{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "polyglot\n",
    "==============================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Downloads](https://img.shields.io/pypi/dm/polyglot.svg \"Downloads\")](https://pypi.python.org/pypi/polyglot)\n",
    "[![Latest Version](https://badge.fury.io/py/polyglot.svg \"Latest Version\")](https://pypi.python.org/pypi/polyglot)\n",
    "[![Build Status](https://travis-ci.org/aboSamoor/polyglot.png?branch=master \"Build Status\")](https://travis-ci.org/aboSamoor/polyglot)\n",
    "[![Documentation Status](https://readthedocs.org/projects/polyglot/badge/?version=latest \"Documentation Status\")](https://readthedocs.org/builds/polyglot/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Polyglot is a natural language pipeline that supports massive multilingual applications.\n",
    "\n",
    "* Free software: GPLv3 license\n",
    "* Documentation: http://polyglot.readthedocs.org.\n",
    "\n",
    "###Features\n",
    "\n",
    "\n",
    "* Tokenization (165 Languages)\n",
    "* Language detection (196 Languages)\n",
    "* Named Entity Recognition (40 Languages)\n",
    "* Part of Speech Tagging (16 Languages)\n",
    "* Sentiment Analysis (136 Languages)\n",
    "* Word Embeddings (137 Languages)\n",
    "* Morphological analysis (135 Languages)\n",
    "* Transliteration (69 Languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Developer\n",
    "\n",
    "* Rami Al-Rfou @ `rmyeid gmail com`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Quick Tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import polyglot\n",
    "from polyglot.text import Text, Word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Language Detection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Language Detected: Code=fr, Name=French\n",
      "\n"
     ]
    }
   ],
   "source": [
    "text = Text(\"Bonjour, Mesdames.\")\n",
    "print(\"Language Detected: Code={}, Name={}\\n\".format(text.language.code, text.language.name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']\n"
     ]
    }
   ],
   "source": [
    "zen = Text(\"Beautiful is better than ugly. \"\n",
    "           \"Explicit is better than implicit. \"\n",
    "           \"Simple is better than complex.\")\n",
    "print(zen.words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Sentence(\"Beautiful is better than ugly.\"), Sentence(\"Explicit is better than implicit.\"), Sentence(\"Simple is better than complex.\")]\n"
     ]
    }
   ],
   "source": [
    "print(zen.sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part of Speech Tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Word            POS Tag\n",
      "------------------------------\n",
      "O               DET\n",
      "primeiro        ADJ\n",
      "uso             NOUN\n",
      "de              ADP\n",
      "desobediência   NOUN\n",
      "civil           ADJ\n",
      "em              ADP\n",
      "massa           NOUN\n",
      "ocorreu         ADJ\n",
      "em              ADP\n",
      "setembro        NOUN\n",
      "de              ADP\n",
      "1906            NUM\n",
      ".               PUNCT\n"
     ]
    }
   ],
   "source": [
    "text = Text(u\"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.\")\n",
    "\n",
    "print(\"{:<16}{}\".format(\"Word\", \"POS Tag\")+\"\\n\"+\"-\"*30)\n",
    "for word, tag in text.pos_tags:\n",
    "    print(u\"{:<16}{:>2}\".format(word, tag))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Named Entity Recognition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[I-LOC([u'Gro\\xdfbritannien']), I-PER([u'Gandhi'])]\n"
     ]
    }
   ],
   "source": [
    "text = Text(u\"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden\")\n",
    "print(text.entities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Polarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Word            Polarity\n",
      "------------------------------\n",
      "Beautiful        0\n",
      "is               0\n",
      "better           1\n",
      "than             0\n",
      "ugly            -1\n",
      ".                0\n"
     ]
    }
   ],
   "source": [
    "print(\"{:<16}{}\".format(\"Word\", \"Polarity\")+\"\\n\"+\"-\"*30)\n",
    "for w in zen.words[:6]:\n",
    "    print(\"{:<16}{:>2}\".format(w, w.polarity))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Neighbors (Synonms) of Obama\n",
      "------------------------------\n",
      "Bush            \n",
      "Reagan          \n",
      "Clinton         \n",
      "Ahmadinejad     \n",
      "Nixon           \n",
      "Karzai          \n",
      "McCain          \n",
      "Biden           \n",
      "Huckabee        \n",
      "Lula            \n",
      "\n",
      "\n",
      "The first 10 dimensions out the 256 dimensions\n",
      "\n",
      "[-2.57382345  1.52175975  0.51070285  1.08678675 -0.74386948 -1.18616164\n",
      "  2.92784619 -0.25694436 -1.40958667 -2.39675403]\n"
     ]
    }
   ],
   "source": [
    "word = Word(\"Obama\", language=\"en\")\n",
    "print(\"Neighbors (Synonms) of {}\".format(word)+\"\\n\"+\"-\"*30)\n",
    "for w in word.neighbors:\n",
    "    print(\"{:<16}\".format(w))\n",
    "print(\"\\n\\nThe first 10 dimensions out the {} dimensions\\n\".format(word.vector.shape[0]))\n",
    "print(word.vector[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Morphology"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'Pre', u'process', u'ing']\n"
     ]
    }
   ],
   "source": [
    "word = Text(\"Preprocessing is an essential step.\").words[0]\n",
    "print(word.morphemes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transliteration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "препрокессинг\n"
     ]
    }
   ],
   "source": [
    "from polyglot.transliteration import Transliterator\n",
    "transliterator = Transliterator(source_lang=\"en\", target_lang=\"ru\")\n",
    "print(transliterator.transliterate(u\"preprocessing\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}